Combine narrative with numbers for effective storytelling in R

Martin Frigaard

Setup

Link to scenario: https://www.katacoda.com/orm-mfrigaard/scenarios/03-effective-storytelling

The index.json configuration file is set using the following environment:

"environment": {
  "uilayout": "editor-terminal-split"
},
"backend": {
  "imageid": "rlang:3.4"
}

Outline

Below are the 20 steps (plus intro and finish) files in the scenario.

Objectives

The objectives for this scenario are:

  1. recognize the needs of your audience (data literacy, level of expertise, etc.)

  2. check and communicate data quality with stakeholders

  3. identify the correct data visualization (single variable, bivariate, and multivariate graphs) based on the data

  4. incorporate feedback from stakeholders/audience into graphs

  5. design visualizations with the appropriate detail and annotations that inform (and do not mislead) the audience

The learners

The learners I’m expecting to be participating in this course are:

  • Jessica will take this class on her own time for professional development.

  • Peter will this course in a two-day weekend because he needs to complete a project.

  • Bruce is using these scenarios to supplement a semester-long undergraduate course on R.

  • Jane has been told to take this course for his job because his team is using R.

intro

  • included in intro.md?

Welcome!

Welcome to ‘Combine narrative with numbers for effective storytelling in R’! In this scenario, we will cover how to build data visualizations that effectively communicate and engage your audience.

Now that we have some experience with data wrangling with tidyr and dplyr, and data visualization with ggplot2, we can put these tools together to get your point across to stakeholders and audiences.

What makes a bad graphic?

Bad graphs aren’t just ugly; they’re misleading. A chart can have pretty colors and novel font styling, but that can’t make up for inaccurately presenting data. Mislabeling axes, using inappropriate scales or labels, and including unnecessary elements (‘chart junk’) are common characteristics of bad graphics.

We’ll be using the terms ‘graph’, ‘figure,’ and ‘chart’ interchangeably throughout this scenario.

Claus Wilke describes the difference between ugly, bad, and wrong graphs in his excellent text, Fundamental of Data Visualization.

  • An ugly graph isn’t aesthetic appealing, but the data presented is clear and informative
  • A bad graph fails to communicate the information it contains because of poor design
  • A graph is wrong if it’s mathematically incorrect (the underlying calculations, representations, or transformations are inaccurate).

We want to avoid making ugly, bad, and wrong graphs.

What makes a great graphic?

Have you ever heard a joke and thought, “it’s funny because it’s true”?

We all know a great data visualization when we see it, but can you explain why it’s so great? Beyond just being aesthetically appealing (the choice of color palettes, fonts, etc.), data visualizations are tools for communicating complicated information.

Well, great graphics are similar to great jokes in this way–they should reveal a complicated ‘truth’ that was otherwise difficult to comprehend or articulate with words alone.

This scenario will show you how to make sure your graphs and figures communicate complexity effectively without misleading your audience.

step 1

  • included in step1.md?
  • grammar?
  • spelling?

Things to consider about your audience

You’ll need to determine who the audience or stakeholders will be before creating graphs or figures. You’ll likely create multiple charts throughout a data analysis that you won’t include as a final deliverable. But in most cases, our stakeholders or audience is whoever is getting the final results or product.

The final graphs you produce will depend on 1) the question we’re trying to answer, and 2) the level of statistical literacy of our audience.

If your audience isn’t familiar with particular data visualizations, provide them with enough information to interpret the graph (and check their understanding). However, if you find yourself spending more time explaining a data visualization’s design than what the visualization reveals, we should consider a different graph.

Asking questions

Data visualization should be an iterative process, and getting regular feedback from your audience will help you understand their point of view. It will also help manage their expectations regarding the final deliverable.

As Hilary Mason and D.J. Patil point out in their 2015 text, Data Driven: Creating a Data Culture, asking the right questions “involves domain knowledge and expertise, coupled with a keen ability to see the problem, see the available data, and match up the two.

The questions below can help guide your project and make sure you understand what your audience/stakeholders are expecting:

  1. What question(s) is this project trying to answer? (or What problem(s) is this project trying to solve?)
  2. Do we have access to the data to answer the question/problem posed? (or do we need to gather more data?)
  3. What is the current format/structure/location of the data? (this will have an enormous impact on the project timeline!)
  4. What context will we present the deliverable(s) in? (slide deck, website, report, etc.)
  5. How familiar will the audience be with the data in our project? (how much background information should we be providing?)

We recommended having the answers to these questions documented somewhere, as this will 1) keep your project focused and timely, 2) ensure both you and your client/customer have a clear vision for successful completion.

You want to fully understand what you are visualizing, who the audience will be, and why they will care about the results.

step 2

  • included in step2.md?
  • grammar?
  • spelling?

Data lineage: the background on our data

It’s best to start a project off with a ‘view of the forest from outside the trees’. The technical term for this is data lineage, which

“includes the data origin, what happens to it, and where it moves over time.”

Having a “birds” eye view’ of the data ensures there weren’t any problems with exporting or importing. Data lineage also means understanding where the data are coming from–was it collected from an internal relational database, an external vendor, or did it come from the web or social media?

Knowing some of the technical details behind a dataset lets us frame the questions or problems we’re trying to tackle. In this scenario, we will use a variety of data sources, but they will all be [tabular](https://en.wikipedia.org/wiki/Table_(information). Tabular data, like spreadsheets, organize data into columns and rows. R can handle multiple kinds of data, but that is a topic that extends beyond the scope of this scenario.

Initiate R

Let’s load some data and get started! Launch an R console by clicking here -> R (Click on the Run command icon)

Load packages

The package we’ll use to view the entire datasets with R is skimr. We will install and load these packages below:

install.packages(c("tidyverse", "skimr"))
library(tidyverse)
library(skimr)

step 3

  • included in step3.md?
  • grammar?
  • spelling?

Before you start: what do we expect to see?

Before starting a new project, we want to set some expectations. The questions we covered in the previous step help us understand what kind of data we’ll be encountering. Sometimes we’ll be dealing with unknown data, but we should know approximately how many columns and rows the new dataset will contain. We might know some basic information about the variable formats, too.

For example, we should see if we’re getting date columns (YYYY-MM-DD), logical (TRUE, FALSE, NA), numerical measurements (integer (1L) or double (1)), or categorical data (character (male and female) or factor (’low,medium,high`)).

Baseball data

We’re going to load a dataset to demonstrate a few ways to investigate a dataset’s quality (or how well it matches our expectations).

These data come from Sean Lahman’s Baseball Database.

Now, I am not going to assume everyone participating in this scenario is familiar with baseball. However, this exercise is arguably more rewarding if you are not a baseball fan. If you’re working with data, part of your job to be interested in whatever you’ve been asked to analyze (even if it is only for the monetary reward).

“…if you want to work in data visualisation, you need to be relentlessly and systematically curious. You should try to get interested in anything and everything that comes your way.” - Alberto Cairo, Knight Chair in Visual Journalism, University of Miami

Analyzing and visualizing data you’re not familiar with is a chance to learn something new, and it puts you in a position to ask ‘out of the box’ questions.

Doing your homework

It’s also essential to read any accompanying documentation for new datasets. If we read the documentation on the Lahman website, we find out that People contains “Player names, DOB, and biographical info.” The variables in People are presented below:

People table

playerID = A unique code assigned to each player. The playerID links the data in this file with records in the other files. birthYear = Year player was born
birthMonth = Month player was born
birthDay = Day player was born
birthCountry = Country where player was born
birthState = State where player was born
birthCity = City where player was born deathYear = Year player died
deathMonth = Month player died
deathDay = Day player died
deathCountry = Country where player died
deathState = State where player died
deathCity = City where player died
nameFirst = Player’s first name
nameLast = Player’s last name
nameGiven = Player’s given name (typically first and middle)
weight = Player’s weight in pounds
height = Player’s height in inches
bats = Player’s batting hand (left, right, or both)
throws = Player’s throwing hand (left or right)
debut = Date that player made first major league appearance
finalGame = Date that player made first major league appearance (blank if still active)
retroID = ID used by retrosheet
bbrefID = ID used by Baseball Reference website

Most of the data pre-processing steps center around a single question: Is this what I expected to see? Reading the documentation gives you expectations about the data to confirm or refute (and then investigate).

Now that we have some background information on this new dataset, we will look at how well People meets our expectations.

Whenever we get a new data source, we should try to view the data in its native format (if possible). We can view the raw data on the Github repository.

Load data

Fortunately, we are also able to load the raw data directly into R using the readr::read_csv() function. We will load the People dataset into R using readr::read_csv(), and assign "https://bit.ly/3scsHw7" to the file argument.

People <- readr::read_csv(file = "https://bit.ly/3scsHw7")

step 4

  • included in step4.md?
  • grammar
  • spelling

Are we seeing what we expected? (1)

Before creating any visualizations, we want a display that gives us an overview of the entire People dataset. In the previous step, we went over some of the People dataset documentation, so we know what to expect.

Skimming data

We’ll be using the skimr package. skimr was is designed for:

“displaying summary statistics the user can skim quickly to understand their data”

Below we pass the People dataset to the skimr::skim() function to create PeopleSkim. We then use the base::summary() function to review the new object.

If this code looks unfamiliar to you, review the Introduction to ggplot2 scenario.

PeopleSkim <- People %>%  
  skimr::skim()
summary(PeopleSkim)
Data summary
Name Piped data
Number of rows 20090
Number of columns 24
_______________________
Column type frequency:
character 14
Date 2
numeric 8
________________________
Group variables None

The output above shows a high-level summary of all the variables in the People dataset. We can see there are 20090 rows and 24 columns (‘14’ columns are character's,2columns areDate’s, and 8 are numeric).

Viewing character variables

The new PeopleSkim object gives us summary information to check against the documentation and help guide our data visualizations. We will start by viewing the variables according to their types in People using skimr::yank() (read the function documentation on Github). The skim_type argument in skimr::yank() takes a variable type ("character", "numeric", or "Date").

Run the code below to use skimr::yank() to view a skim of the character variables in the People dataset.

PeopleSkim %>% 
  skimr::yank(skim_type = "character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
playerID 0 1.00 5 9 0 20090 0
birthCountry 61 1.00 3 14 0 57 0
birthState 535 0.97 2 22 0 295 0
birthCity 174 0.99 3 26 0 4880 0
deathCountry 10235 0.49 3 14 0 25 0
deathState 10285 0.49 2 20 0 107 0
deathCity 10240 0.49 2 26 0 2676 0
nameFirst 37 1.00 2 14 0 2522 0
nameLast 0 1.00 2 14 0 10219 0
nameGiven 37 1.00 2 43 0 13325 0
bats 1180 0.94 1 1 0 3 0
throws 977 0.95 1 1 0 3 0
retroID 56 1.00 8 8 0 20034 0
bbrefID 2 1.00 5 9 0 20088 0

We can see none of these data are missing (n_missing and complete_rate). Skimr::skim() also shows us the min, max, empty, n_unique, and whitespace for the 14 character values.

Viewing date variables

Next, we use skimr::yank() to view a skim of the Date variables in the People dataset.

PeopleSkim %>% 
  skimr::yank(skim_type = "Date")

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
debut 198 0.99 1871-05-04 2020-09-27 1965-04-14 10572
finalGame 198 0.99 1871-05-05 2020-09-27 1971-05-06 9480

The skim of the Date variables shows us which data are missing (n_missing and complete_rate), along with the earliest (min), latest (max), and middle (median).

The number of unique (n_unique) dates prints to the next line. This behavior is because the terminal window has a width limit. If the Terminal output extends past this limit, the content gets printed to the line below.

Do these numbers make sense?

We can use these values for sanity checks. For example, the n_unique for playerID matches the total number of rows in People, which we should expect from the documentation (playerID = “A unique code assigned to each player”). The earliest dates for both debut and finalGame are in May of 1871 (which corresponds to the first MLB game ever played).

step 5

  • included in step5.md?
  • grammar
  • spelling

Are we seeing what we expected? (2)

In the previous step, we viewed a skim() of the "character" and "Date" variables in the People dataset. We’re going to continue ‘skimming’ these data and check them against our expectations.

Viewing numeric variables

We’ll use skimr::yank() and skimr::focus() to view the n_missing and complete_rate for the "numeric" variables in People.

PeopleSkim %>% 
  focus(n_missing, complete_rate) %>% 
    yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate
birthYear 114 0.99
birthMonth 282 0.99
birthDay 424 0.98
deathYear 10232 0.49
deathMonth 10233 0.49
deathDay 10234 0.49
weight 817 0.96
height 737 0.96

The complete_rate for birthYear, birthMonth, birthDay, weight and height are over 90%. However, the deathYear, deathMonth and height is under 50%. Why do you think these data have missing values?

The skim() output for the "numeric" variables give us a set of summary statistics:

Location statistics

  • the mean (or average) gives us the expected value for each variable
  • the median (as p50) or the ‘center’ value for each variable. Half of the values are above, and half are below.
PeopleSkim %>% 
  skimr::focus(numeric.mean, numeric.p50) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable mean p50
birthYear 1934.42 1942
birthMonth 6.63 7
birthDay 15.62 16
deathYear 1966.29 1968
deathMonth 6.49 6
deathDay 15.53 15
weight 187.80 185
height 72.34 72

Spread statistics

  • the lowest value for each variable, or minimum (as p0)
  • the highest value for each variable, or maximum (as p100)
  • Together, these two values can give us the range, which is the difference between the maximum and minimum values
PeopleSkim %>% 
  skimr::focus(numeric.p0, numeric.p100) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable p0 p100
birthYear 1820 2000
birthMonth 1 12
birthDay 1 31
deathYear 1872 2020
deathMonth 1 12
deathDay 1 31
weight 65 2125
height 43 83
  • the first quartile (as p25), which is the ‘middle’ of the data points below the median
  • the third quartile (as p75), which is the ‘middle’ of the data points above the median
  • Together, these two values can give us the interquartile range (IQR), which is the difference between the third and first quartiles
PeopleSkim %>% 
  skimr::focus(numeric.p25, numeric.p75) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable p25 p75
birthYear 1896 1973
birthMonth 4 10
birthDay 8 23
deathYear 1943 1993
deathMonth 3 10
deathDay 8 23
weight 172 200
height 71 74
  • the standard deviation (as sd), a measure of each variable’s disbursement.
  • The standard deviation describes how far a variable’s values are spread out around their mean
PeopleSkim %>% 
  skimr::focus(numeric.mean, numeric.sd) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable mean sd
birthYear 1934.42 42.73
birthMonth 6.63 3.47
birthDay 15.62 8.76
deathYear 1966.29 32.95
deathMonth 6.49 3.53
deathDay 15.53 8.79
weight 187.80 26.33
height 72.34 2.61

These numbers can be challenging to make sense of by themselves. Fortunately, the skimr::skim() output comes with a hist column. The hist column is a small histogram for the numeric variables.

Below we use skimr::focus() and skimr::yank() to view the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in People.

PeopleSkim %>% 
  skimr::focus(numeric.mean, numeric.sd, 
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
birthYear 1934.42 42.73 1820 1942 2000 ▁▅▅▆▇
birthMonth 6.63 3.47 1 7 12 ▇▅▅▆▇
birthDay 15.62 8.76 1 16 31 ▇▇▇▇▆
deathYear 1966.29 32.95 1872 1968 2020 ▁▃▆▇▇
deathMonth 6.49 3.53 1 6 12 ▇▅▅▅▇
deathDay 15.53 8.79 1 15 31 ▇▇▇▆▆
weight 187.80 26.33 65 185 2125 ▇▁▁▁▁
height 72.34 2.61 43 72 83 ▁▁▁▇▁

The hist column shows us a miniature distribution of the values in each numeric variable.

Do these numbers make sense?

  • As we can see, the majority of the missing values are in the variables with the death prefix (deathDay, deathMonth, and deathYear). The missing values in these variables make sense because, given the lowest birthYear value (1820), we should expect approximately half of the baseball players in the People dataset to be still alive.

  • We also notice an implausible value from the skimr output: the weight variable maximum value (2125). We can use dplyr’s filter and select functions to find the nameGiven for the abnormally high weight value.

People %>% 
  filter(weight == 2125) %>% 
  select(nameGiven, birthMonth, birthDay, birthYear, weight)
#> # A tibble: 1 x 5
#>   nameGiven    birthMonth birthDay birthYear weight
#>   <chr>             <dbl>    <dbl>     <dbl>  <dbl>
#> 1 Jacob Robert         10       28      1996   2125

Google the player’s name. What is his listed weight on Wikipedia?

step 6

  • included in step6.md?
  • grammar
  • spelling

Counting things

“Data science is mostly counting things.” - Sam Firke

Data visualizations are drawings made with numbers. The challenge is picking the best image for the numbers you want to show. Before you can choose what you want to draw, you need to decide which numbers you’d like to display.

Column/bar charts

In a bar (or column) chart, each bar/column length represents a numeric value. The number of levels determines the number of bars or columns.

We will create a bar chart of the bats variable in People, which measures whether the player bats left-handed (L), right-handed (R), both (B), or if these data are missing (NA). Below we’ll use dplyr ’s count() function to tally the number of values for the different category items in bats.

People %>% 
  count(bats, sort = TRUE)
#> # A tibble: 4 x 2
#>   bats      n
#>   <chr> <int>
#> 1 R     12435
#> 2 L      5246
#> 3 B      1229
#> 4 <NA>   1180

In ggplot2, we create a bar chart using the geom_bar() function. First we map bats to both x and fill inside the ggplot(aes()) functions. If you need a refresher on ggplot2 layers and mapping, check out the previous scenario.

We also remove the legend with guides(fill = FALSE), and add labels for title, subtitle, caption, and y axis (x is set to NULL).

# click to execute code
gg_step6_bar_01 <- People %>% 
  ggplot(aes(x = bats, fill = bats)) + 
  geom_bar() + 
  guides(fill = FALSE) +
  labs(title = "MILB Player's batting hand",
       subtitle = "Left (L), right (R), or both (B)",
       caption = "source: http://www.seanlahman.com/",
       x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_bar_01,
#        filename = "gg-step6-bar-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step6_bar_01

We will need to open the gg-step6-bar-01.png graph in the vscode IDE (above the Terminal console).

Counting top 10 birth countries

geom_bar() only takes a single categorical (or factor) variable. Sometimes we’ll need to specify the variable we want to map to the y axis. For example, the code below uses dplyr::filter() and dplyr::count() to get return top 10 non-US ‘Country of birth’ for players in the People dataset. We then use utils::head() to return the top 10 rows, and dplyr::mutate() with forcats::fct_inorder() to change the format of the birthCountry variable to a factor.

People %>% 
  filter(birthCountry != "USA" & !is.na(birthCountry)) %>% 
  count(birthCountry, sort = TRUE) %>% 
  head(10) %>% 
  mutate(birthCountry = fct_inorder(birthCountry))
#> # A tibble: 10 x 2
#>    birthCountry       n
#>    <fct>          <int>
#>  1 D.R.             791
#>  2 Venezuela        425
#>  3 P.R.             270
#>  4 CAN              255
#>  5 Cuba             223
#>  6 Mexico           136
#>  7 Japan             70
#>  8 Panama            65
#>  9 United Kingdom    52
#> 10 Ireland           50

In this case, we have two variables in this output: birthCountry and ‘n’. If we’re going to build a graph from these data, we know we’ll need a way to represent both the country’s name and the number of times it occurs.

For this graph, we will need to use ggplot2'sgeom_col()` function (we will include the code from above, which creates a dataset with only the top 10 non-US birth countries).

We also map birthCountry to fill, remove the legend with guides(fill = FALSE), and add a label for the y axis with ggplot2::labs().

# click to execute code
gg_step6_col_01 <- People %>% 
  filter(birthCountry != "USA" & !is.na(birthCountry)) %>% 
  count(birthCountry, sort = TRUE) %>% 
  head(10) %>% 
  mutate(birthCountry = fct_inorder(birthCountry)) %>%  
  ggplot(aes(x = birthCountry, y = n, fill = birthCountry)) + 
  geom_col() +
  guides(fill = FALSE) +
  labs(title = "Top 10 Non-US birth countries for MLB players",
       subtitle = "Based on birthCountry",
       caption = "source: http://www.seanlahman.com/",
       x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_col_01,
#        filename = "gg-step6-col-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step6_col_01

We will need to open the gg-step6-col-01.png graph in the vscode IDE (above the Terminal console).

Flip the coordinates

If we find the x axis gets cluttered and difficult to read, we can pivot the columns’ display with the ggplot2::coord_flip() function. Note that we also need to change the forcats function in mutate() to fct_reorder().

# click to execute code
gg_step6_col_02 <- People %>% 
  filter(birthCountry != "USA" & !is.na(birthCountry)) %>% 
  count(birthCountry, sort = TRUE) %>% 
  head(10) %>% 
  # reorder the birthCountry by n
  mutate(birthCountry = fct_reorder(birthCountry, n)) %>%  
  ggplot(aes(x = birthCountry, y = n, 
             fill = birthCountry)) + 
  geom_col() +
  guides(fill = FALSE) +
  # flip coordinates
  coord_flip() +
  labs(title = "Top 10 Non-US birth countries for MLB players",
       subtitle = "Based on birthCountry",
       caption = "source: http://www.seanlahman.com/",
       x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_col_02,
#        filename = "gg-step6-col-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step6_col_02

We will need to open the gg-step6-col-02.png graph in the vscode IDE (above the Terminal console).

Communication tips

Bar charts and column charts display the counts for each different response in the categorical variable. We can help our audience interpret these graphs by always setting the count axis scale to zero, sorting the chart’s values to make it easier to read, and flipping the x axis if it is difficult to read.

step 7

  • included in step7.md?
  • grammar?
  • spelling?

Single variable distributions (1)

The skimr output displayed a small histogram for each numeric variable in the People dataset in the previous steps. Histograms are a special kind of bar/column chart–they show the distribution for numeric variables by ‘binning’ each response into a set number of bars/columns.

Load data

These data come from the TidyTuesday project, a data repository who’s intent is

“to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions.”

We’re going to use a dataset of Ramen ratings from The Ramen Rater. Read more about these data here.

Below we import the raw data from an external .csv file into Ramen and get a skimr::skim() summary (stored in RamenSkim)

Ramen <- readr::read_csv("https://bit.ly/38sO0S7")
RamenSkim <- skimr::skim(Ramen)

Review data

View the character variables in RamenSkim

RamenSkim %>% 
  skimr::yank("character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
brand 0 1 1 34 0 456 0
variety 0 1 3 96 0 2971 0
style 2 1 3 10 0 8 0
country 0 1 2 13 0 44 0

How complete are these data?

View the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in Ramen.

RamenSkim %>% 
  skimr::focus(numeric.mean, numeric.sd, 
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist) %>% 
  skimr::yank("numeric")

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
review_number 1590.06 917.94 1 1590.00 3180 ▇▇▇▇▇
stars 3.69 1.03 0 3.75 5 ▁▁▂▇▅

Pay attention to the hist column for stars–it shows the distribution for the values. Where are most of the values concentrated?

We will investigate the distribution of stars by building a histogram with ggplot2.

Build a histogram

We’re going to use ggplot2::geom_histogram() to view the distribution the stars variable in Ramen. Note that we are also assigning labels to the graph that includes 1) a clear title, 2) descriptive information about the graph, 3) the source of the data.

# click to execute code
gg_step7_hist_01 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_histogram() + 
  labs(
       title = "Distribution of ramen stars", 
       subtitle = "bins = 30",
       caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step7_hist_01,
#        filename = "gg-step7-hist-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step7_hist_01

We will need to open the gg-step7-hist-01.png graph in the vscode IDE (above the Terminal console).

As we stated above, histograms stack the variable values into a defined set of bins. The default number for bins is 30. We can change the shape of the histogram by changing the bins argument.

Run the code below to see how the distribution looks with 20 bins. Note we also include the color = "white" argument to ensure we can see each bar separately.

# click to execute code
gg_step7_hist_02 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_histogram(bins = 20, color = "white") + 
  labs(
       title = "Distribution of ramen stars", 
       subtitle = "bins = 20",
       caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step7_hist_02,
#        filename = "gg-step7-hist-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step7_hist_02

Open the gg-step7-hist-02.png graph in the vscode IDE (above the Terminal console).

The stars values fit into 20 bins better than the default 30 because we can see where values are concentrated (and the high frequency of 5-star ratings).

step 8

  • included in step8.md?
  • grammar?
  • spelling?

Single variable distributions (2)

The previous step demonstrated how to use a histogram to view the distribution of a single variable. We needed to adjust the bins in the histogram to make its shape easier to interpret. Density plots use kernel smoothing to create cleaner distributions.

Build a density plot

We’re going to use ggplot2::geom_density() to view a density plot of the stars variable in Ramen. We will use fill to color the area underneath the density line with "dodgerblue".

# click to execute code
gg_step8_density_01 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_density(fill = "dodgerblue") + 
  labs(title = "Distribution of ramen stars", 
  caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step8_density_01,
#        filename = "gg-step8-density-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step8_density_01

Open the gg-step8-density-01.png graph in the vscode IDE (above the Terminal console).

Adding useful labels

Although density plots create a much smoother distribution, the y axis is harder to interpret. To overcome this, we will add two summary statistics programmatically to the labels using the base::paste0() and base::round() functions.

Run the code below to see how this works:

# click to execute code
subtitle_dens_stars <- paste0("Star rating (mean +/- sd): ", 
       # use round() to make sure there are only two decimals
       round(mean(Ramen$stars, na.rm = TRUE), 2),
       " +/- ",
       round(sd(Ramen$stars, na.rm = TRUE), 2))
subtitle_dens_stars
#> [1] "Star rating (mean +/- sd): 3.69 +/- 1.03"

We can now supply subtitle_dens_stars to the labs(subtitle = ) function.

Creating labels this way ensures they are updated whenever the underlying data change.

# click to execute code
gg_step8_density_02 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_density(fill = "dodgerblue") + 
  labs(title = "Distribution of ramen stars", 
       # combine text with mean() and sd() for stars in Ramen
       subtitle = subtitle_dens_stars,
       # include source
       caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step8_density_02,
#        filename = "gg-step8-density-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step8_density_02

Open the gg-step8-density-02.png graph in the vscode IDE (above the Terminal console).

As we’ve said, an essential part of effective communication is knowing your audience. It’s unlikely these exploratory graphs will be part of our final deliverable, so the audience for these graphs will likely be us!

Using descriptive labels makes sure we know what we’re seeing when we’re viewing our graphs.

step 9

  • included in step9.md?
  • grammar?
  • spelling?

Multiple variable distributions (1)

We’ve looked at the distribution of all the values in the stars variable, but what if we were interested in the distribution of stars across the groups in another categorical variable, like style, which is the Style of container (cup, pack, tray, etc.).

We can check the levels of style with dplyr::count()

Ramen %>% 
  count(style, sort = TRUE)
#> # A tibble: 9 x 2
#>   style          n
#>   <chr>      <int>
#> 1 Pack        1832
#> 2 Bowl         612
#> 3 Cup          559
#> 4 Tray         138
#> 5 Box           32
#> 6 Restaurant     3
#> 7 <NA>           2
#> 8 Bar            1
#> 9 Can            1

The output above tells us the top five most common reviews for Ramen came from Packs, Bowls, Cups, Trays, and Boxes.

Grouped skims

We can use dplyrs filter, select, and group_by functions with skimr to see the distribution of the stars variable across the five most common style levels.

# click to execute code
Ramen %>% 
  # filter to most common styles
  filter(style %in% c("Pack", "Bowl",
                      "Cup", "Tray", "Box")) %>% 
  # select only stars and style
  select(stars, style) %>% 
  # group dataset by style
  group_by(style) %>% 
  # skim grouped data
  skim() %>% 
  # focus on select output
  skimr::focus(n_missing, style,
               numeric.mean, numeric.sd, numeric.hist,
               numeric.p0, numeric.p50, numeric.p100) %>% 
  # only return numeric values
  skimr::yank("numeric") 

Variable type: numeric

skim_variable style n_missing mean sd p0 p50 p100 hist
stars Bowl 0 3.71 1.05 0 3.75 5 ▁▁▂▇▆
stars Box 0 4.21 1.29 0 5.00 5 ▁▁▁▂▇
stars Cup 0 3.49 1.04 0 3.50 5 ▁▁▂▇▃
stars Pack 14 3.74 0.99 0 3.75 5 ▁▁▂▇▆
stars Tray 0 3.60 1.14 0 3.75 5 ▁▁▂▇▆

The output shows Ramen from a Box has the highest stars rating. We are going to confirm this with a ridgeline plot.

The ggridges package

The mean and median (p50) in the skimr output tells us the distribution of stars varies slightly for the filtered levels of style, so we will view the density for each distribution with a ridgeline plot from the ggridges package.

Install and load ggridges below:

# click to execute code
install.packages("ggridges")
library(ggridges)

Build labels first!

We’ll build the labels for this graph first in labs_ridge_stars_style, so we know what we’re expecting to see.

# click to execute code
labs_ridge_stars_style <- labs(
       title = "Star ratings by style",  
       subtitle = "Star rating across most common ramen containers",
       caption = "source: https://www.theramenrater.com/resources-2/the-list/",
       x = "Star rating", 
       y = NULL) 

I’ve found this practice to be very helpful for conceptualizing graphs before I begin building them, which reduces errors and saves time!

Overlapping density plots

The code below uses ggridges::geom_density_ridges() function to build overlapping density plots. In this plot, we map the fill argument to the style variable. We also want to set the guides(fill = ) to FALSE because we’ll have labels on the graph for each level of style.

# # click to execute code
gg_step9_ridge_01 <- Ramen %>%
  # filter to most common styles
  filter(style %in% c("Pack", "Bowl",
                      "Cup", "Tray", "Box")) %>%
  ggplot(aes(x = stars,
             y = style,
             fill = style)) +
  geom_density_ridges() +
  guides(fill = FALSE) +
  # add labels
  labs_ridge_stars_style
# # save
# ggsave(plot = gg_step9_ridge_01,
#        filename = "gg-step9-ridge-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step9_ridge_01

Open the gg-step9-ridge-01.png graph in the vscode IDE (above the Terminal console).

We can see that the stars ratings for the Box level in style are concentrated around 5 from the ridgeline plot.

step 10

  • included in step10.md?
  • grammar?
  • spelling?

Multiple variable distributions (2)

In the last step, we learned the distribution for Ramen stars ratings varied across the five most common levels of style. This step will view the variation of stars across style with a box-plot. Box-plots are a great way of viewing the summary statistics for a numeric variable (like stars) across multiple levels of a categorical variable (like style).

Box-plot labels

We’ll build the labels for the graph similar to the labels we used for the ridgeline plot, but we’ll be a little more explicit with the subtitle and x axis.

# click to execute code
labs_box_stars_style <- labs(
     title = "Star ratings by style",  
     subtitle = "Star ratings across pack, bowl, cup, tray, and box containers",
     caption = "source: https://www.theramenrater.com/resources-2/the-list/",
     x = "Ramen star ratings", 
     y = NULL) 

Building box-plots

We’ll filter the data to the five most common style’s again and map stars to the x axis and style to the y axis. We will also map style to the fill aesthetic inside ggplot2::geom_boxplot().

We don’t need a guide (or legend), so we will remove it with guides(fill = FALSE).

gg_step10_boxplot_01 <- Ramen %>% 
    # filter to most common styles
  filter(style %in% c("Pack", "Bowl",
                      "Cup", "Tray", "Box")) %>%
  ggplot(aes(x = stars, y = style)) + 
  geom_boxplot(aes(fill = style)) +
  guides(fill = FALSE) + 
  labs_box_stars_style
# save
# ggsave(plot = gg_step10_boxplot_01,
#        filename = "gg-step10-boxplot-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step10_boxplot_01

Open the gg-step10-boxplot-01.png graph in the vscode IDE (above the Terminal console).

In the next step, we will cover how to interpret the contents of a box-plot.

step 11

  • included in step11.md?
  • grammar?
  • spelling?

Multiple variable distributions (3)

Box plots display many of the numbers we see in the skimr output.

Contents of a box-plot

Click on the code below to create a skimr summary of Ramen stars ratings in `Tray’s.

# click to execute code
Ramen %>% 
  # filter to most common styles
  filter(style == "Tray") %>% 
  # select only stars and style
  select(stars, style) %>% 
  # group dataset by style
  group_by(style) %>% 
  # skim grouped data
  skimr::skim() %>% 
  # focus on select output
  skimr::focus(style,
               numeric.p0, numeric.p25, numeric.p50,
               numeric.p75, numeric.p100, numeric.hist) %>% 
  # only return numeric values
  skimr::yank("numeric") 

Variable type: numeric

skim_variable style p0 p25 p50 p75 p100 hist
stars Tray 0 3 3.75 4.25 5 ▁▁▂▇▆

We can calculate the interquartile range using dplyr below:

# click to execute code
Ramen %>% 
  # filter to most common styles
  filter(style == "Tray") %>% 
  # select only stars and style
  select(stars, style) %>% 
  # group dataset by style
  group_by(style) %>% 
  # summarize IQR
  summarize(
    `Stars/Tray IQR` = IQR(stars, na.rm = TRUE))
#> # A tibble: 1 x 2
#>   style `Stars/Tray IQR`
#> * <chr>            <dbl>
#> 1 Tray              1.25

The figure below shows a box-plot for the distribution of stars ratings across the Tray level of style. We’ve labeled the summary statistics from the skimr output on the graph. The 25th percentile (or p25) is the box’s first vertical line (or hinge). The 50th percentile (or p50) is the median and middle line of the box, and the 75th percentile (p75) is the last vertical line in the box (or hinge).

We’ve also labeled the minimum (p0) and maximum (p100) values and the interquartile range (which is similar to the standard deviation).

Communication tip

If your audience is not familiar with box-plots, the figure above is an example of supporting information to include. It explains how to read the graph, using an example from the finished chart. However, using complex plots adds mental labor to your audience and can take attention away from the point you’re trying to make. Effective communication means always using the most straightforward (or most common) graphs to reveal your findings.

step 12

  • included in step12.md?
  • grammar?
  • spelling?

Communicating column/bar charts

Now that we’ve created a few graphs, we should stop and consider what narrative information we’ve contained in each plot. We used column and bar charts for displaying the counts for two categorical variables:

  • bats, which measures whether MLB players are left-handed (L), right-handed (R), both (B), or if these data are missing (NA)
  • birthCountry, which tells us the ‘Country of Birth’ for each MLB player

We can use these graphs to convey comparison information. For example, we can see from the bar graph that most MLB players are right-handed batters.

Bar graph

The column chart displays similar information. The columns can be used for comparing the frequency of one country to the other. We’ve also arranged the columns in a way that makes it clear which country appears the most and which country appears the least.

Column graph

The text to accompany your graphs will largely depend on the context of the problem you’re trying to solve (or question you’re trying to answer), but there are a few general guidelines we can apply to each type of graph.

Communcation (labels)

Titles should be objective and neutral, expressing the “who,” “what,” and “where” of the figure’s measurements. Avoid jargon and unnecessary descriptive words. Stick with 1) what data was measured, 2) when the data was measured, and 3) how the data was counted (i.e., the units).

When you are building labels, plan on providing enough information that the chart becomes a ‘stand-alone product.’ By this, we mean that if a new observer viewed your graph, they would at minimum be able to understand what point the figure was trying to make (i.e., “this graph shows the values in X variable,” or “this figure shows the relationship between X and Y”).

Communication (distributions)

We’ve explored the distribution of the stars rating variable in the Ramen dataset by itself in the histogram and density plot and across the top five types in the ridgeline plot and box-plot.

Histogram

Density graph

Ridgeline plot

Box-plot

Distribution plots are useful if we’re answering exploratory questions about a variable before calculating statistics or building a model. Histograms, density, and ridgeline plots can quickly tell us if a variable has a normal (bell-shaped) distribution, which is a crucial assumption to check before modeling. Box-plots require a higher degree of statistical literacy for interpretation, so we recommend confirming your audience is familiar with these graphs before relying on them. Summary statistics are also vital to include with distribution graphs (usually in a supplementary table) because it tethers the figure to mathematical values.

The information from these exploratory charts gives your narrative context and frames the problem. If we were telling a story, this would be the portion that tells us the setting or universe in which our characters live.

step 13

  • included in step13.md?
  • grammar?
  • spelling?

Visualizing relationships (1)

Now that we know how to explore variable distributions, we will look at relationships between two (or more) variables. We must establish want kind of relationship we’re investigating before deciding what plot to make. We’ve already been exploring the relationship between two variables. We looked at the distribution (or spread) of stars vs. five categories of style in the Ramen dataset, and we also looked at the birthCountrys vs. n (the count of each birth country) in the People dataset.

In this step, we will look at how two numeric variables change (or vary) and if that change is in the same (or opposite) direction. One of the most common graphs for visualizing this relationship between two numerical variables is the scatterplot. A scatterplot uses points to display two numeric variables, with one variable on each axis.

Star Wars data

We will load the starwars data from the dplyr package. These data come from the Star Wars API. Read more about this dataset on dplyrs website.

# click to execute code
StarWars <- dplyr::starwars 

We will look at the relationship between height and mass for characters in the StarWars dataset. Let’s start by looking at the n_missing and complete_rate for these two variables.

# click to execute code
StarWars %>% 
  select(height, mass) %>% 
  skimr::skim() %>% 
  skimr::focus(n_missing, complete_rate) %>% 
  skimr::yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate
height 6 0.93
mass 28 0.68

We can see that almost 2/3rds of the data in mass are missing. This amount of missing data might surprise us if we didn’t explore it before plotting.

We will also view the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for these two columns.

# click to execute code
StarWars %>% 
  select(height, mass) %>% 
  skimr::skim() %>% 
    skimr::focus(numeric.mean, numeric.sd, 
                 numeric.p0, numeric.p25, 
                 numeric.p50, numeric.p75, 
                 numeric.p100, numeric.hist) %>% 
  skimr::yank("numeric") 

Variable type: numeric

skim_variable mean sd p0 p25 p50 p75 p100 hist
height 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁

We can see at least one value of mass that is considerably higher than the rest. We can tell because the location statistics are similar to each other (mean = 97.3, median (p50) = 84.5), but the spread is almost twice the value of the location (sd = 169). The maximum value (p100) of 1358 also confirms this finding. What is going on with this value?

Investigate outliers

It’s always a good idea to investigate values that seem implausible (like we did with the abnormally high weight for the MLB player in step 5). If we can’t figure out what is going on, we should communicate this with our stakeholders. Outliers can have a big impact on data visualizations (and statistical models), so ensuring we account for them is essential for communicating with our audience.

We’re going to filter the StarWars data only observations with mass more than 180, and select only the name, height, mass, sex and species columns (we chose 200 because it’s approximately 2x the p75 value).

StarWars %>% 
    filter(mass > 200) %>% 
    select(name, height, mass, sex, species)
#> # A tibble: 1 x 5
#>   name                  height  mass sex            species
#>   <chr>                  <int> <dbl> <chr>          <chr>  
#> 1 Jabba Desilijic Tiure    175  1358 hermaphroditic Hutt

We can now see this mass belongs to Jabba the Hutt, which makes sense if we do some additional research.

Labels

Now that we know what we’re going to visualize, we can make our labels.

# click to execute code
labs_scatter_ht_mass_01 <- labs(
  title = "Star Wars Character's height and mass", 
  x = "Mass", 
  y = "Height")

Scatterplots

We will create a scatterplot with ggplot2::geom_point() by mapping mass to the x axis and map height to the y axis.

# click to execute code
gg_step13_scatter_01 <- StarWars %>% 
  ggplot(aes(x = mass, y = height)) + 
  geom_point() + 
  # add labels
  labs_scatter_ht_mass_01
# save
# ggsave(plot = gg_step13_scatter_01,
#        filename = "gg-step13-scatter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step13_scatter_01

Open the gg-step13-scatter-01.png graph in the vscode IDE (above the Terminal console). Notice how the points are all clustered on the left-hand side of the chart? The x axis has extended to account for Jabba the Hutt’s high mass value, which has made it hard to interpret the relationship among the other values.

What happens when we remove the outliers?

If you encounter outliers, it’s a good idea to view each graph with and without them to see how much they influence the plot. Let’s filter Jabba the Hutt’s mass out of the Starwars dataset and rebuild the scatterplot. We’ll also add this information to a new set of labels, so we don’t get the two graphs confused.

# click to execute code

# new labels
labs_scatter_ht_mass_02 <- labs(
  title = "Star Wars Character's height and mass", 
  subtitle = "Characters with mass less than 200",
  x = "Mass", 
  y = "Height")

# build graph
gg_step13_scatter_02 <- StarWars %>% 
  filter(mass < 200) %>% 
  ggplot(aes(x = mass, y = height)) + 
  geom_point() + 
  # add labels
  labs_scatter_ht_mass_02
# save
# ggsave(plot = gg_step13_scatter_01,
#        filename = "gg-step13-scatter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step13_scatter_02

Open the gg-step13-scatter-02.png graph in the vscode IDE (above the Terminal console).

Based on the scatterplot, we can see a positive relationship between mass and height for Star Wars characters. But is this the same for all types of characters? For example, does this relationship hold for all levels of gender?

step 14

  • included in step14.md?
  • grammar?
  • spelling?

Visualizing relationships (2)

We will continue looking at the relationship between mass and height in the Starwars dataset. We looked at mass and height with and without an outlier’s influence in the previous step. In this step, we will add a categorical variable (gender) to the plot to see if the direction of the change for mass and height is the same for all levels of gender.

Counting with tabyls

Let’s view the count of gender below using the tabyl() function from the janitor package.

# click to execute code
install.packages("janitor")
library(janitor)

janitor::tabyl() works similar to dplyr::count(), but automatically prints a bit more information in the output. Click on the code block below to create a tably for the gender variable.

# click to execute code
StarWars %>% 
  janitor::tabyl(gender) 
#>     gender  n    percent valid_percent
#>   feminine 17 0.19540230     0.2048193
#>  masculine 66 0.75862069     0.7951807
#>       <NA>  4 0.04597701            NA

We can see the standard output produces a percent and valid_percent columns. We can also add percent formatting with janitor::adorn_pct_formatting():

# click to execute code
StarWars %>% 
  janitor::tabyl(gender) %>% 
  janitor::adorn_pct_formatting()
#>     gender  n percent valid_percent
#>   feminine 17   19.5%         20.5%
#>  masculine 66   75.9%         79.5%
#>       <NA>  4    4.6%             -

This output tells us 4 characters in the StarWars dataset will not show up if we use the gender variable. Read more about the tabyl function options here.

Scatterplot (3 variables)

One way to include the gender variable in the scatterplot is to map it to the color aesthetic. The output from tabyl tells us there are 4 values in gender that will be missing from this plot.

We will update our labels and add gender to the scatterplot in the code below.

# click to execute code
labs_scatter_ht_mass_gender <- labs(
  title = "Star Wars Character's gender, height and mass", 
  subtitle = "Data for gender (feminine/masculine), height, and mass < 200",
  x = "Mass", 
  color = "Gender",
  y = "Height")

gg_step14_scatter_01 <- StarWars %>% 
  filter(!is.na(gender) & mass < 200) %>% 
  ggplot(aes(x = mass, y = height, color = gender)) + 
  geom_point() +
  # add labels
  labs_scatter_ht_mass_02
# save
# ggsave(plot = gg_step14_scatter_01,
#        filename = "gg-step14-scatter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step14_scatter_01

Open the gg-step14-scatter-01.png graph in the vscode IDE (above the Terminal console).

The color of the points shows that the feminine characters occupy a smaller range of values for the relationship between mass and height.

step 15

  • included in step13.md?
  • grammar?
  • spelling?

Visualizing relationships (3)

Sometimes we will want to look at how a particular measurement changes over time or trends. When we’re visualizing trends, the x axis typically the measure of time, and the y axis contains our measurement of interest. Each time point along the x axis has a corresponding value on the y axis, and lines connect these points. These lines are extended along the x axis’s full scale to display the change over time (or the trend). See the example from the FiveThirtyEight article titled, “Comic Books Are Still Made By Men, For Men And About Men”:

We’re going to re-create this chart using data from the fivethirtyeightdata package.

Comic Book Data

The fivethirtyeightdata package contains data from the FiveThirtyEight Github repository, but these data have been formatted to provide “tame data principles for introductory statistics and data science courses.

We are going to load the comic_characters dataset from the article above. We’re only interested in a subset of this dataset, so we select the relevant variables and do some initial formatting steps before assigning them to the ComicData (read more about the data here).

# click to execute code
ComicData <- read_csv("https://bit.ly/3oS1zQY") 
# subset data
ComicData <- ComicData %>% 
  # select only publisher, name, sex, year, and date
  select(publisher, name, sex, year, date) %>% 
  # filter to only the rows containing either male or female characters
  filter(sex %in% c("Female Characters", "Male Characters")) %>% 
  # convert these two variables to factors
  mutate(sex = factor(sex, 
                      levels = c("Female Characters", 
                                 "Male Characters")),
         publisher = factor(publisher, 
                            levels = c("Marvel", "DC"))) %>% 
  # remove all missing values
  drop_na()
# view
glimpse(ComicData)
#> Rows: 21,408
#> Columns: 5
#> $ publisher <fct> DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC,…
#> $ name      <chr> "Batman (Bruce Wayne)", "Superman (Clark Kent)", "Green Lan…
#> $ sex       <fct> Male Characters, Male Characters, Male Characters, Male Cha…
#> $ year      <dbl> 1939, 1986, 1959, 1987, 1940, 1941, 1941, 1989, 1969, 1956,…
#> $ date      <date> 1939-05-01, 1986-10-01, 1959-10-01, 1987-02-01, 1940-04-01…

We formatted two variables in ComicData as factor's. We will useskimr::skim()to get an overview ofpublisherandsex`.

Factors

Factor variables are unique kinds of qualitative or categorical variables in R because they have a “fixed and known set of possible values.”. We assigned these values with the levels argument.

The skimr output below shows us the two new factor variables we’ve created in ComicNewFemalePerc.

# click to execute code
ComicData %>% 
  skimr::skim() %>% 
  skimr::yank("factor") 

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
publisher 0 1 FALSE 2 Mar: 14728, DC: 6680
sex 0 1 FALSE 2 Mal: 15834, Fem: 5574

The top_counts column tells us the counts for both levels in publisher and sex.

Summarizing data

To recreate the graph above, we’ll need to summarize the ComicData data. We need year represented on the x axis, and the percentage of female comic book characters for each publisher represented on the y axis. We can do this with dplyrs group_by(), summarize(), mutate() and ungroup() functions.

The code below creates two new variables:

  • sex_n_per_yr_pub, which is the count of each level of sex per year and publisher, and
  • sex_pct_per_yr_pub, which is the percentage of each level of sex per year and publisher

We also filter the year variable to only data between 1980 and 2010.

# click to execute code
ComicNewFemalePerc <- ComicData %>% 
  group_by(year, publisher, sex) %>% 
  summarize(sex_n_per_yr_pub = sum(n())) %>% 
  group_by(year, publisher) %>% 
  mutate(sex_pct_per_yr_pub = sex_n_per_yr_pub / sum(sex_n_per_yr_pub),
         sex_pct_per_yr_pub = round(sex_pct_per_yr_pub, digits = 3)) %>% 
  ungroup() %>% 
  filter(year > 1979 & year < 2011)
head(ComicNewFemalePerc)
#> # A tibble: 6 x 5
#>    year publisher sex               sex_n_per_yr_pub sex_pct_per_yr_pub
#>   <dbl> <fct>     <fct>                        <int>              <dbl>
#> 1  1980 Marvel    Female Characters               60              0.243
#> 2  1980 Marvel    Male Characters                187              0.757
#> 3  1980 DC        Female Characters               10              0.278
#> 4  1980 DC        Male Characters                 26              0.722
#> 5  1981 Marvel    Female Characters               58              0.271
#> 6  1981 Marvel    Male Characters                156              0.729

Labels

We will build labels identical to those in the FiveThirtyEight graph but include the original article’s URL as a caption.

# click to execute code
labs_line_comics <- labs(
  title = "Comics Aren't Gaining Many Female Characters", 
  subtitle = "Percentage of new characters who are female", 
  caption = "https://fivethirtyeight.com/features/women-in-comic-books/",
  x = NULL, 
  y = NULL)

Line graph

Now that we have our data and labels, we can build the line graph. First, we need to filter the data to only female percentages, then pass the filtered data to ggplot(aes()), mapping the year to the x axis and sex_pct_per_yr_pub to the y axis.

On the next layer, inside the ggplot2::geom_line() function, we map publisher to the group and color aesthetics inside the aes() function. Outside the aes() function, we make the lines larger by setting the size to 2.

We need to make a few more customizations to this graph to make it look like the one in the article:

  • Note the FiveThirtyEight graph does not have a legend or guide. We can remove the legend by adding a ggplot2::theme(legend.position = "none") layer.

  • The y axis in the FiveThirtyEight graph ranges from 0 to 50 and is formatted as a percent (%). Displaying percentage units helps the audience understand that a proportion is displayed (not the raw counts). We can change the formatting on the y axis with the ggplot2::scale_y_continuous() function. Set the limits argument to c(0.00, 0.50)), and the labels argument to scales::percent_format(accuracy = 1).

# click to execute code
gg_step15_line_01 <- ComicNewFemalePerc %>% 
  filter(sex == "Female Characters") %>% 
  ggplot(aes(x = year, y = sex_pct_per_yr_pub)) + 
  geom_line(aes(group = publisher, color = publisher),
            size = 2) + 
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(0.00, 0.50),
              labels = scales::percent_format(accuracy = 1)) +
  # add labels
  labs_line_comics
# save
# ggsave(plot = gg_step15_line_01,
#        filename = "gg-step15-line-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step15_line_01

Open the gg-step15-line-01.png graph in the vscode IDE (above the Terminal console).

Our graph is starting to look like the figure in the article, but we still need to make a few changes. We removed the legend, so we have no way of knowing which line belongs to which publisher (DC or Marvel). In the next section, we will learn how to add these labels onto the graph near their respective lines.

step 16

  • included in step16.md?
  • grammar?
  • spelling?

Adding text annotations

In the previous step, we started recreating the following graph from the FiveThirtyEight article titled, “Comic Books Are Still Made By Men, For Men And About Men”.

We left off with the line graph in the gg-step15-line-01.png file. The FiveThirtyEight graph designers chose to remove the legend and add text labels directly to the chart, helping communicate the trend to audiences. This practice isn’t always the best choice, but it works in this graph for a few reasons, which we will cover below.

When to add text to graphs

Text on graphs is essential to communicating the graph’s contents. After all, a figure without text is just a drawing. We’ve already covered what information to include in the graph labels (i.e., the title, subtitle, caption, and x and y axes). Annotations are additional text we place in the plot area to help the audience understand what we’re trying to show.

We recommend adding text annotations as long as they aren’t obscuring or distracting the audience from the point you’re trying to make.

In the FiveThirtyEight graph, the text annotations work because there are only two levels for the measure of interest (DC and Marvel). Encoding this information in color aesthetics is helpful because we can set them to contrasting hues (i.e., blue and red). If there were more publisher levels, the color aesthetic might not be the best choice.

Lastly, the text annotations highlight the graph title’s information (“Comics Aren’t Gaining Many Female Characters”). Both annotations are in the same general area on the x axis, and their placement on the y axis is far enough from the line that the text doesn’t overlap.

Building text annotations

To get our graph closer to the FiveThirtyEight figure, we’re going to set the line colors with ggplot2::scale_color_manual(). This function takes a values argument, which we will fill with two colors, c("firebrick3", "dodgerblue").

In ggplot2, we can add text annotations with the annotate() function. Each level of publisher gets a layer and requires the following arguments:

  • geom = "text" this lets annotate() know we want text
  • x = this is the position on the x axis we want the annotation
  • y = this is the position on the y axis we want the annotation
  • label = this is the text we want to use for the annotation
  • size = 7 the size of the text
  • color = the same values we supplied to scale_color_manual()
# click to execute code
gg_step16_line_01 <- ComicNewFemalePerc %>% 
  filter(sex == "Female Characters") %>% 
  ggplot(aes(x = year, y = sex_pct_per_yr_pub)) + 
  geom_line(aes(group = publisher, color = publisher),
            size = 2) + 
  scale_y_continuous(limits = c(0.00, 0.50),
           labels = scales::percent_format(accuracy = 1)) + 
  # color the axis
  scale_color_manual(values = c("firebrick3", 
                                "dodgerblue")) +
  # add text annotations
  annotate(geom = "text", x = 2001, y = .47, 
           label = "DC", size = 7, 
           color = "dodgerblue") +
  annotate(geom = "text", x = 2002, y = .25, 
           label = "Marvel", size = 7, 
           color = "firebrick3") + 
  # add labels
  labs_line_comics
# save
# ggsave(plot = gg_step16_line_01,
#        filename = "gg-step16-line-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step16_line_01

Open the gg-step16-line-01.png graph in the vscode IDE (above the Terminal console).

We’re almost there! We will make a few final adjustments to this graph to change the fonts displayed on the title and x and y axes.

Adjusting font attributes

The title font in the FiveThirtyEight graph is bold, and the subtitle text size is approximately the same size as the text on the x and y axes. We can adjust the font attributes in a ggplot2 graph with the theme() function. We’ve already used the theme() function in the previous step to remove the legend (legend.position = "none").

The ggplot2::theme() function comes with many options for customizing the way our graph looks. We will only cover a small number of these, and you should read more about using ggplot2 in the excellent R Graphics Cookbook, 2nd Edition by Winston Chang.

To change the font text attributes, we will set the following arguments in theme() (note that each argument also takes the ggplot2::element_text() function):

  • Change the plot.title text with element_text(size = 18, face = "bold")
  • Change the plot.subtitle text element_text(size = 15)
  • Change the axis.text.y and axis.text.x text with element_text(size = 14)
# click to execute code
gg_step16_line_02 <- ComicNewFemalePerc %>% 
  filter(sex == "Female Characters") %>% 
  ggplot(aes(x = year, y = sex_pct_per_yr_pub)) + 
  geom_line(aes(group = publisher, color = publisher),
            size = 2) + 
  scale_y_continuous(limits = c(0.00, 0.50),
           labels = scales::percent_format(accuracy = 1)) + 
  # adjust text on labels and axes
  theme(legend.position = "none", 
        plot.title = element_text(size = 18, 
                                  face = "bold"),
        plot.subtitle = element_text(size = 15),
        axis.text.y = element_text(size = 14),
        axis.text.x = element_text(size = 14)) +
  # color the axis
  scale_color_manual(values = c("firebrick3", 
                                "dodgerblue")) +
  # add text annotations
  annotate(geom = "text", x = 2001, y = .47, 
           label = "DC", size = 7, 
           color = "dodgerblue") +
  annotate(geom = "text", x = 2002, y = .25, 
           label = "Marvel", size = 7, 
           color = "firebrick3") + 
  # add labels
  labs_line_comics
# save
# ggsave(plot = gg_step16_line_02,
#        filename = "gg-step16-line-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step16_line_02

Open the gg-step16-line-02.png graph in the vscode IDE (above the Terminal console).

Now our plot looks very close to the original graph in the FiveThirtyEight article. If you like the looks of this graph, be sure to check out the ggthemes package for the theme_fivethirtyeight() and others.

step 17

  • included in step17.md?
  • grammar?
  • spelling?

Multiple graph displays

Up to this point, the data we’ve presented resulted in a single visualization. Sometimes data are so complex and layered that they will require multiple graphs. In this case, we will want to arrange numerous figures together in a grid or facet. Charts presented this way are called ‘Small Multiples’, a term popularized by Edward Tufte.

“Small multiples resemble the frames of a movie: a series of graphics, showing the same combination of variables, indexed by changes in another variable.” Edward Tufte, The Visual Display of Quantitative Information.

We are going to demonstrate faceting with a dataset we’ve created from the internet movie database (IMDB).

IMDB data

IMDB makes multiple datasets available for download. We’ve combined the title.ratings.tsv, name.basics.tsv, and title.principals.tsv datasets into the ImdbData dataset with the following columns.

tconst - alphanumeric unique identifier of the title (used for joining)

nconst - alphanumeric unique identifier of the name/person (used for joining)

category - the category of job that person was in

primaryName – name by which the person is most often credited

birthYear – in YYYY format

averageRating – weighted average of all the individual user ratings

numVotes - number of votes the title has received

primaryTitle – the more popular title / the title used by the filmmakers on promotional materials at the point of release

originalTitle - original title, in the original language

isAdult - non-adult title or adult title

startYear – represents the release year of a title. In the case of TV Series, it is the series start year.

runtimeMinutes – primary runtime of the title, in minutes

genres – includes up to three genres associated with the title.

age_lead - age of actor/actress at time of film (age_lead = startYear - birthYear)

# click to execute code
ImdbData <- readr::read_csv("https://bit.ly/3qr6rg6")
glimpse(ImdbData)
#> Rows: 136,925
#> Columns: 14
#> $ tconst         <chr> "tt0000574", "tt0000630", "tt0000886", "tt0001101", "t…
#> $ nconst         <chr> "nm0846887", "nm0624446", "nm0609814", "nm0923594", "n…
#> $ category       <chr> "actress", "actress", "actor", "actor", "actor", "actr…
#> $ primaryName    <chr> "Elizabeth Tait", "Fernanda Negri Pouget", "Jean Moune…
#> $ birthYear      <dbl> 1879, 1889, 1841, 1870, 1869, 1844, 1891, 1879, 1867, …
#> $ averageRating  <dbl> 6.1, 3.2, 5.0, 5.2, 4.7, 5.8, 5.5, 3.9, 4.6, 5.3, 4.2,…
#> $ numVotes       <dbl> 609, 11, 23, 13, 10, 22, 47, 11, 13, 28, 14, 13, 25, 1…
#> $ primaryTitle   <chr> "The Story of the Kelly Gang", "Hamlet", "Hamlet, Prin…
#> $ originalTitle  <chr> "The Story of the Kelly Gang", "Amleto", "Hamlet", "Ab…
#> $ isAdult        <chr> "non-adult title", "non-adult title", "non-adult title…
#> $ startYear      <dbl> 1906, 1908, 1910, 1910, 1910, 1912, 1910, 1911, 1910, …
#> $ runtimeMinutes <dbl> 70, NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA…
#> $ genres         <chr> "Biography,Crime,Drama", "Drama", "Drama", "\\N", "Cri…
#> $ age_lead       <dbl> 27, 19, 69, 40, 41, 68, 19, 32, 43, 28, 52, 39, 23, 41…

Below we build a skimr::skim() of the ImdbData dataset.

# click to execute code
SkimImdbData <- skimr::skim(ImdbData)
summary(SkimImdbData)
Data summary
Name ImdbData
Number of rows 136925
Number of columns 14
_______________________
Column type frequency:
character 8
numeric 6
________________________
Group variables None

First, we will review the character variables.

# click to execute code
SkimImdbData %>% 
  skimr::yank("character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
tconst 0 1 9 10 0 136925 0
nconst 0 1 9 10 0 41765 0
category 0 1 5 7 0 2 0
primaryName 0 1 2 38 0 41658 0
primaryTitle 0 1 1 196 0 124228 0
originalTitle 0 1 1 196 0 128171 0
isAdult 0 1 11 15 0 2 0
genres 0 1 2 31 0 1056 0

These all look complete.

Do these numbers make sense?

The number of individual responses (n_unique) for each character variable is a good source for sanity checks. For example, the largest number of unique values belongs to the title id variable (tconst = 136925), and this is identical to the number of rows in the dataset. The next largest number belongs to the originalTitle variable (128171), and the documentation tells us this variable is the title for the film in its original language. By itself, this number doesn’t tell us much, but we can see the next largest number (124228) is the film’s primaryTitle, and it makes sense that the number of unique responses for these two variables is almost the same.

It also makes sense that the n_unique for actor/actress (nconst) is close to the actor/actress primaryName. There should be way more titles (originalTitle or primaryTitle) than genres, and there are (1056).

Finally, we can see the two binary variables we read about above (category and isAdult) only list 2 unique values (in n_unique), so it appears we imported these variables correctly.

Next, we will review the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in ImdbData.

# click to execute code
SkimImdbData %>% 
  skimr::focus(numeric.mean, numeric.sd, 
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist) %>% 
  skimr::yank("numeric")

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
birthYear 1946.85 27.79 1839 1950.0 2015 ▁▂▆▇▂
averageRating 6.00 1.16 1 6.1 10 ▁▂▇▆▁
numVotes 6015.29 43514.99 10 125.0 2334927 ▇▁▁▁▁
startYear 1985.28 26.46 1906 1990.0 2021 ▁▂▃▅▇
runtimeMinutes 97.14 24.13 2 94.0 1500 ▇▁▁▁▁
age_lead 38.43 12.95 1 36.0 98 ▁▇▅▁▁

Do these numbers make sense?

  • The average birthYear is 1947, which is plausible considering the date range for movies in the IMDB (1906 - 2021).

  • The average movie rating is a 6.00, which can be a little confusing considering IMDB’s rating scale. Still, we can feel confident these data aren’t skewed because the mean and median (p50) are relatively close to each other.

  • The number of votes (numVotes) is the most skewed variable because it ranges from 10 to 2334927.

  • The startYear for the movie has an average of 1985, and increases steadily from 1906 to 2021, making sense because more films are being made every year.

  • The average length of each movie in ImdbData is 97.1 minutes (runtimeMinutes). But we can also see from the hist that the range for runtimeMinutes includes some very low and high values (p0 = 2 and p100 = 1500).

The actor/actress’s average age is 38.4, with a low of 1 and a high of 98 (both plausible).

Creating categories from a numerical variable

We will proceed under the assumption that our stakeholders asked us to help explain the relationship between the average rating a movie received (averageRating) and the number of votes that went into the score (numVotes).

There are quite a few years in this dataset, so instead, we will split each measure into decades. To do this, we need a categorical variable from the startYear variable. The cut() function is handy because we can supply the number of breaks we want to split the numeric startYear variable into (12 in this case). We will also create some clear labels for this categorical variable with the labels argument and make sure the format is ordered.

We check our new factor variable with the fct_count() from the forcats package.

# click to execute code
ImdbData <- ImdbData %>% 
  mutate(year_cat10 = cut(x = startYear, 
                          breaks = 12, 
                          labels = c("1910s", "1920s", "1930s", 
                                     "1940s", "1950s", "1960s", 
                                     "1970s", "1980s", "1990s", 
                                     "2000s",  "2010s", "2020s"),
                          ordered = TRUE))
# check the count of our factor levels 
fct_count(f = ImdbData$year_cat10, sort = TRUE)
#> # A tibble: 12 x 2
#>    f         n
#>    <fct> <int>
#>  1 2020s 25898
#>  2 2010s 24430
#>  3 1990s 15196
#>  4 2000s 14914
#>  5 1970s 13253
#>  6 1980s 12466
#>  7 1960s 10204
#>  8 1940s  7714
#>  9 1950s  6809
#> 10 1930s  4102
#> 11 1920s  1615
#> 12 1910s   324

step 18

  • included in step18.md?
  • grammar?
  • spelling?

Number of votes

We want to examine how the numVotes variable changed over time (year_cat10). We want each decade on the x axis and the numVotes for each film on the y. Let’s review the numVotes variable below with skimr::skim()

ImdbData$numVotes %>% skimr::skim()
Data summary
Name Piped data
Number of rows 136925
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 6015.29 43514.99 10 33 125 666 2334927 ▇▁▁▁▁

We can see the values for this variable are concentrated, or skewed, towards 0. However, we know enough information to build our labels.

labs_numvotes_yearcat10 <- labs(
  title = "Number of IMDB review votes by decade", 
  subtitle = "Internet Movie Database (IMDB)",
  caption = "https://www.imdb.com/", 
  x = "Decade", 
  y = "Number of votes")

Scatterplots with geom_jitter

We’re going to view the distribution of numVotes across each decade in year_cat10 with ggplot2::geom_jitter(). geom_jitter() geom is similar to geom_point(), except it adds “a small amount of random variation to the location of each point.” We will set the following arguments inside the geom_jitter() to help demonstrate how it works.

  • size = 0.9 = this controls how small/large the point will be
  • width = 0.25 = determines how wide we want the points to ‘jitter’ (the default value is .40, so we’re decreasing it slightly)
  • alpha = 1/6 = the alpha controls the saturation (or transparency) of the points
  • show.legend = FALSE = remove the legend (it’s labeled across the x axis)
# # click to execute code
gg_step18_jitter_01 <- ImdbData %>%
  ggplot(aes(x = year_cat10,
             y = numVotes,
             fill = year_cat10)) +
  geom_jitter(size = 0.9, 
            width = 0.25,
            alpha = 1/6, 
            show.legend = FALSE) + 
  guides(fill = FALSE) +
  # add labels
  labs_numvotes_yearcat10
# # save
# ggsave(plot = gg_step18_jitter_01,
#        filename = "gg-step18-jitter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step18_jitter_01

Open the gg-step18-jitter-01.png graph in the vscode IDE (above the Terminal console). Notice how there are a handful of extreme points above 15000000?

Data with extreme values like this can be removed or transformed. If we have a reason to exclude these data, we can do filter the data to only the range we want in the graph (say Number of votes < 1500000), and then include this in the plot.

Run the code below to see what this would look like:

# click to execute code
labs_jitter_numvote_yearcat10_02 <- labs(
  title = "Number of IMDB review votes by decade", 
  subtitle = "Internet Movie Database (IMDB)",
  caption = "*Number of votes < 1500000; https://www.imdb.com", 
  x = "Decade", 
  y = "Number of votes")

gg_step18_jitter_02 <- ImdbData %>% 
  filter(numVotes < 1500000) %>% 
  ggplot(aes(x = year_cat10, 
             y = numVotes)) + 
  geom_jitter(size = 0.9, 
              width = 0.25,
              alpha = 1/6, 
              show.legend = FALSE) + 
  labs_jitter_numvote_yearcat10_02
# save
# ggsave(plot = gg_step18_jitter_02,
#        filename = "gg-step18-jitter-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step18_jitter_02

Open the gg-step18-jitter-02.png graph in the vscode IDE (above the Terminal console). Now the extreme points above 15000000 have been removed.

Another option is to transform the axis, which we can do with ggplot2::scale_y_log10(). This function changes the y axis scale using a base-10 log transformation.

We will also use the handy label_log10 function developed by Claus Wilke

# click to execute code

#   load label_log10 function 
source("https://bit.ly/35Ywt2q")

#  create new labels
labs_jitter_numvote_yearcat10_03 <- labs(
  title = "Number of IMDB review votes by decade", 
  subtitle = "Internet Movie Database (IMDB)",
  caption = "*Number of votes log10 transformed; https://www.imdb.com", 
  x = "Decade", 
  y = "log10(Number of votes)")

#   create new graph
gg_step18_jitter_03 <- ImdbData %>% 
  ggplot(aes(x = year_cat10, 
             y = numVotes)) + 
  geom_jitter(size = 0.9, 
              width = 0.25,
              alpha = 1/6, 
              show.legend = FALSE) + 
    scale_y_log10(labels = label_log10) + 
  labs_jitter_numvote_yearcat10_03
#   save
# ggsave(plot = gg_step18_jitter_03,
#        filename = "gg-step18-jitter-03.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step18_jitter_03

Open the gg-step18-jitter-03.png graph in the vscode IDE (above the Terminal console). We can see the scale_y_log10() transformation spreads the points out more uniformly across the y axis and makes it more challenging to interpret the number of votes.

Regardless of choice (removal of the extreme values or transforming the scale), we want to communicate these changes to our stakeholders. We also want to include them in any reports or write-ups to give them a distorted view of the underlying variable.

Read more about transforming a scale in R for Data Science.

step 19

  • included in step19.md?
  • grammar?
  • spelling?

Average individual user ratings

The averageRating gives us the weighted average of all the individual user ratings. We will refresh our memory about this variable by taking a quick look at the skimr::skim() output below:

ImdbData$averageRating %>% 
  skimr::skim()
Data summary
Name Piped data
Number of rows 136925
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 6 1.16 1 5.3 6.1 6.8 10 ▁▂▇▆▁

The distribution for averageRating looks evenly distributed around the mean and median, and the hist looks relatively symmetrical. We will see if this holds for averageRating when we look at it across the levels of year_cat10.

We now know enough about what we’re expecting to see in the graph to build the labels for the average individual user ratings (averageRating) vs. time (year_cat10).

labs_avgratng_yearcat10 <- labs(
  title = "Average IMDB individual user ratings by decade", 
  subtitle = "Internet Movie Database (IMDB)",
  caption = "https://www.imdb.com/", 
  x = "Decade", 
  y = "Average individual user ratings")

Violin plots with geom_violin

We’re going to view the distribution of averageRating using the ggplot2::geom_violin() function. This function takes the following arguments:

  • map year_cat10 to fill inside the initial ggplot(aes()) functions
  • set show.legend = FALSE inside the geom_violin() function
# # click to execute code
gg_step19_violin_01 <- ImdbData %>%
  ggplot(aes(x = year_cat10,
             y = averageRating,
             fill = year_cat10)) +
  geom_violin(show.legend = FALSE) + 
  # add labels
  labs_avgratng_yearcat10
# # save
# ggsave(plot = gg_step19_violin_01,
#        filename = "gg-step19-violin-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step19_violin_01

Open the gg-step19-violin-01.png graph in the vscode IDE (above the Terminal console).

We can see the violin plot shows us the averageRating variable as a smoothed colored area, with tails extending out to the ends of the distribution (towards the violin’s ‘neck’). The colored areas’ width in the graph shows us that the distribution (or spread) for averageRating has increased over time.

The next section will examine how the numVotes and averageRating relate to each other over time.

step 20

  • included in step20.md?
  • grammar?
  • spelling?

Small multiples

We’ve been exploring the relationship between the number of votes and average user rating for movies in the IMDB movies dataset. So far, we’ve looked at each variable’s distribution, and we discovered the number of votes has increased over time with a skewed distribution. The average user movie rating distribution has ‘flattened’ over time, but the location (i.e., the mean and median), has remained relatively stable.

Recall that we can use skimr to check these summary statistics.

Number of votes (summary)

Below is the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for the numVotes, grouped by year_cat10.

# click to execute code
ImdbData %>% 
  group_by(year_cat10) %>% 
  select(year_cat10, numVotes) %>% 
  skimr::skim() %>% 
  skimr::focus(numeric.mean, numeric.sd, 
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist)
Data summary
Name Piped data
Number of rows 136925
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables year_cat10

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
numVotes 223.35 1327.77 10 22 22671 ▇▁▁▁▁
numVotes 579.97 5005.06 10 27 112785 ▇▁▁▁▁
numVotes 686.31 5486.03 10 56 167259 ▇▁▁▁▁
numVotes 981.25 11114.29 10 60 520667 ▇▁▁▁▁
numVotes 1174.25 9046.01 10 78 404337 ▇▁▁▁▁
numVotes 1582.93 14908.81 10 68 687259 ▇▁▁▁▁
numVotes 1635.41 20128.38 10 69 1614179 ▇▁▁▁▁
numVotes 2717.53 30294.50 10 69 1228225 ▇▁▁▁▁
numVotes 4471.46 33454.25 10 85 1266076 ▇▁▁▁▁
numVotes 9814.88 65715.80 10 183 2334927 ▇▁▁▁▁
numVotes 11634.93 60739.55 10 273 2295898 ▇▁▁▁▁
numVotes 9034.78 49947.97 10 282 1512248 ▇▁▁▁▁

Average individual user ratings (summary)

Below is the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for the averageRating, grouped by year_cat10.

# click to execute code
ImdbData %>% 
  group_by(year_cat10) %>% 
  select(year_cat10, averageRating) %>% 
  skimr::skim() %>% 
  skimr::focus(numeric.mean, numeric.sd, 
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist)
Data summary
Name Piped data
Number of rows 136925
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables year_cat10

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
averageRating 5.75 1.08 2.0 5.9 8.2 ▁▁▆▇▂
averageRating 6.14 1.07 1.0 6.4 9.4 ▁▁▃▇▁
averageRating 6.18 0.91 1.7 6.3 9.0 ▁▁▅▇▁
averageRating 6.19 0.84 1.1 6.3 8.7 ▁▁▂▇▁
averageRating 6.31 0.87 1.9 6.4 9.0 ▁▁▅▇▁
averageRating 6.23 1.00 1.8 6.3 9.2 ▁▁▆▇▁
averageRating 6.11 1.09 1.4 6.2 9.3 ▁▁▆▇▁
averageRating 6.06 1.13 1.1 6.2 9.3 ▁▁▆▇▁
averageRating 6.00 1.17 1.0 6.1 9.5 ▁▂▇▇▁
averageRating 5.90 1.20 1.1 6.1 9.3 ▁▂▇▇▁
averageRating 5.87 1.25 1.0 6.1 9.6 ▁▂▇▇▁
averageRating 5.83 1.28 1.0 6.0 10.0 ▁▂▇▅▁

How do you think the relationship between numVotes and averageRating will look?

Votes vs. user rating

Below we create labels and build a scatter plot with the two numeric variables (numVotes and averageRating)

# click to execute code

#   build labels
labs_avgusr_nmvote <- labs(
  title = "Number of Votes vs average individual user ratings", 
  subtitle = "Internet Movie Database (IMDB)",
  caption = "https://www.imdb.com/", 
  x = "Number of votes", 
  y = "Average individual user ratings")

#   build graph
gg_step20_scatter_01 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating)) +
  geom_point(size = 1.2, 
              alpha = 1/5, 
              show.legend = FALSE) + 
  labs_avgusr_nmvote

# save
# ggsave(plot = gg_step20_scatter_01,
#        filename = "gg-step20-scatter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step20_scatter_01

Open the gg-step20-scatter-01.png graph in the vscode IDE (above the Terminal console).

The scatter plot shows a majority of the points concentrated at the lower end of numVotes and the mean/median of averageRating. However, these points extend out diagonally to a few cases with high values for both the averageRating and numVotes. This pattern indicates that more votes are related to higher average user ratings for some movies.

log10(Votes) vs. user rating

Below we’re going to rebuild the scatterplot between numVotes and averageRating. Note what we include the log10 transformation information in multiple places, so we don’t confuse our audience.

# click to execute code

#   build labels 
labs_avgusr_lognmvote <- labs(
  title = "*Number of Votes vs average individual user ratings", 
  subtitle = "Internet Movie Database (https://www.imdb.com/)",
  caption = "*Number of votes is has been log10 transformed", 
  y = "Average individual user ratings",
  x = "log10(Number of votes)")

#   build plot
gg_step20_scatter_02 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating)) +
  geom_point(size = 1.2, 
              alpha = 1/6, 
              show.legend = FALSE) + 
  # add log10 transformation
    scale_x_log10(labels = label_log10) +
  # add label
  labs_avgusr_lognmvote

# save
# ggsave(plot = gg_step20_scatter_02,
#        filename = "gg-step20-scatter-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")

gg_step20_scatter_02

Open the gg-step20-scatter-02.png graph in the vscode IDE (above the Terminal console).

We can see the log10 transformation of the numVotes has shifted the points across the new x axis scale. The distributions’ shape shows us a trend similar to the scatterplot above but reduces the over-plotting.

Facetting for multiple comparisons

We want to make direct comparisons across time by placing an individual plot for each decade within the same view. The best way to accomplish this is by using small multiples, where we repeat the same graph for each snapshot of time and present them in a grid.

The functions for creating small multiples in ggplot2 are facet_wrap() or facet_grid(). We will demonstrate this below using the former, and we also make some adjustments to the x axis to make it easier to read:

inside the ggplot2::scale_x_continuous() function:

  • set limits to 10 (the minimum) and 2334927 (the maximum)
  • the breaks argument specifies how we want to ‘break up’ our x axis. We will use the minimum, 1167464 (which is 0.5 x the maximum), and the maximum
  • labels what text do we want to display at each break?

The ggplot2::facet_wrap() function uses a ~ followed by the categorical variable (year_cat10) we’ve specified in the ggplot(aes(group = )) argument.

Now we can build our labels and small multiples graph.

# click to execute code

#   build labels 
labs_avgusr_nmvote_yearcat10 <- labs(
  title = "Number of votes vs average individual user ratings over time", 
  subtitle = "Internet Movie Database (IMDB)",
  caption = "https://www.imdb.com", 
  y = "Average individual user ratings",
  x = "Number of votes")


gg_step20_facet_01 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating,
             group = year_cat10)) +
  geom_point(size = 0.5, 
              alpha = 1/10, 
              show.legend = FALSE) + 
  # add x scale attributes
  scale_x_continuous(limits = c(10, 2334927),
                     breaks = c(10, 1167464, 2334927), 
                     labels = c('10', '~1.2M', '~2.3M')) +
  # add facet
    facet_wrap(~ year_cat10) + 
  labs_avgusr_nmvote_yearcat10

# save
# ggsave(plot = gg_step20_facet_01,
#        filename = "gg-step20-facet-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")

gg_step20_facet_01

Open the gg-step20-facet-01.png graph in the vscode IDE (above the Terminal console).

We can see how small multiples allow for more incisive comparisons. The relationship between votes and average user rating becomes slightly evident as time progresses. A few data points approach the 8.75 line in 1960 and 1970, and there appears to be a slight upward trend for average user rating from 1990-2020.

Now we will use small multiples on the log10 transformed numVotes variable, which means we need to update our labels, add the scale_x_log10() layer, and use facet_wrap() with year_cat10.

# click to execute code

#   build labels 
labs_avgusr_lognmvote_yearcat10 <- labs(
  title = "*Number of votes vs average individual user ratings over time", 
  subtitle = "Internet Movie Database (https://www.imdb.com/)",
  caption = "*Number of votes is has been log10 transformed", 
  y = "Average individual user ratings",
  x = "log10(Number of votes)")

gg_step20_facet_02 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating,
             group = year_cat10)) +
  geom_point(size = 0.5, 
             alpha = 1/10, 
             show.legend = FALSE) + 
  # add log10 x axis
    scale_x_log10(labels = label_log10) +
  # facet wrap on decades
    facet_wrap(~ year_cat10) + 
  # add labels
  labs_avgusr_lognmvote_yearcat10

# save
# ggsave(plot = gg_step20_facet_02,
#        filename = "gg-step20-facet-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")

gg_step20_facet_02

Open the gg-step20-facet-02.png graph in the vscode IDE (above the Terminal console).

Now the small multiples show the relationship between the log10 transformed number of votes and average user rating. The majority of the data points appear concentrated around the same value for average user rating (~6.125).

Adding the best fit line (geom_smooth)

We can use ggplot2‘s geom_smooth() function to draw a smoothed ’best-fit line’ through the data points in each decade. Modeling is beyond the scope of this scenario, but feel free to read more about building models in this chapter of R for Data Science.

For now, know the following:

  • method = 'lm' tells geom_smooth to fit the best straight line through the data points
  • size = 0.75 is the size of the line
  • color = "firebrick2" will make the line color red
  • fullrange = TRUE will draw the line through the entire range of the graph
# click to execute code

#   build labels 
labs_avgusr_lognmvote_yearcat10_lm <- labs(
  title = "*Number of votes vs average individual user ratings over time", 
  subtitle = "Internet Movie Database (https://www.imdb.com/)",
  caption = "*Number of votes is has been log10 transformed; lm smooth line", 
  y = "Average individual user ratings",
  x = "log10(Number of votes)")

gg_step20_facet_03 <- ImdbData %>%
  ggplot(aes(x = numVotes,
             y = averageRating,
             geoup = year_cat10)) +
  geom_point(size = 0.5, 
              alpha = 1/10, 
              show.legend = FALSE) + 
  # add smoothed line
  geom_smooth(method = 'lm',
              size = 0.75,
              color = "firebrick2",
              fullrange = TRUE) +
  # add x axis attributes
    scale_x_log10(labels = label_log10) +
  # add facets
    facet_wrap(~ year_cat10)

# save
# ggsave(plot = gg_step20_facet_03,
#        filename = "gg-step20-facet-03.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")

gg_step20_facet_03
#> `geom_smooth()` using formula 'y ~ x'

Open the gg-step20-facet-03.png graph in the vscode IDE (above the Terminal console). We can see the relationship for the log10 transformed number of votes vs. average user rating is slightly positive but has become less pronounced over time. The gray band is the standard error associated with the red smoothed line we’ve drawn (fewer data points = more error).

finish

  • included in finish.md?
  • grammar?
  • spelling?

Communicating relationships

A graph describing a relationship answers a certain kind of question. These questions center around topics like divergence (“as x goes increases/decreases, y moves up/down”), correlations (“how much does each unit of y increase/decrease with each unit of x?”), or time-series (“how does x change over time?”).

When presenting a graph with relationships, consider the context and framing for the conclusions your audience will draw. Is this good news? For example, if the chart displays a drop in sales over time, anticipate how this will change your presentation’s tone, and be ready to answer questions. It’s also important not to confuse your audience when designing graphs. Relationships with ‘good news’ should have the data points showing a positive trend (i.e., as x values increase, so do y values) and vice-versa. You don’t want to find yourself in a situation where you’re explaining that your graph doesn’t show what you’re audience is seeing.

If the relationship is sufficiently complex, consider using small multiples to compare the information in a series of graphs. Pay close attention to the x and y-axis when building small multiple plots, so the audience knows what comparisons to make.

Communication (outliers)

If there are extreme values in your data, investigate why these outliers exist. Determine if you should include the outliers, or, if you decide to remove them, document this justification and communicate it clearly with your audience/stakeholders.

Read more about visualizing missing data here and on the visdat package site, or on the inspectdf package website.

In the StarWars dataset, we saw how removing Jabba the Hutt changed the relationship between mass and height.

Comunication (scales)

We should make sure we are presenting the scales in our graphs. For example, we calculated the percentage of female characters in comic book publishers, and we made sure the y axis included the appropriate units.

We also used text annotations to highlight and label the details we wanted our audience to focus on.

Communication (transformation)

If we find a variable skewed and need to transform it, we should make sure the stakeholders understand what we’ve done. We also need to communicate the additional effort to interpret the results.

We use the log10 transformation on the number of votes in the IMDB dataset to see its relationship to average user ratings.

We should include accompanying information on interpreting the x axis and any other graphs or tables built with the transformed variable.

Thank you!

We’ve concluded the “Use graphs and data for effective storytelling”! Thank you for completing, and be sure to check out the other scenarios on Katacoda